Whats Covered

Aditional Resources


Case study


Lahman’s Baseball Database

  • library(Lahman)
  • Sean Lahman’s baseball database
  • 26 tables
  • A great example of real worl relational data

This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2016. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.

 This database was created by Sean Lahman, who pioneered the effort
 to make baseball statistics freely available to the general
 public. What started as a one man effort in 1994 has grown
 tremendously, and now a team of researchers have collected their
 efforts to make this the largest and most accurate source for
 baseball statistics available anywhere.

 This database, in the form of an R package offers a variety of
 interesting challenges and opportunities for data processing and
 visualization in R.

 In the current version, the examples make extensive use of the
 'dplyr' package for data manipulation (tabulation, queries,
 summaries, merging, etc.), reflecting the original relational
 database design and 'ggplot2' for graphics.

 **Details:**

   Package:   Lahman                 
   Type:      Package                
   Version:   6.0-0                  
   Date:      2017-07-04             
   License:   GPL version 2 or newer 
   LazyLoad:  yes                    
   LazyData:  yes                    
  
 The main form of this database is a relational database in
 Microsoft Access format. The design follows these general
 principles: Each player is assigned a unique code ('playerID').
 All of the information in different tables relating to that player
 is tagged with his 'playerID'.  The 'playerID's are linked to
 names and birthdates in the 'Master' table.  Similar links exist
 among other tables via analogous '*ID' variables.

 The database is composed of the following main tables:

 'Master' Player names, dates of birth, death and other
      biographical info

 'Batting' batting statistics

 'Pitching' pitching statistics

 'Fielding' fielding statistics

 A collection of other tables is also provided:

 Teams:

             'Teams'  yearly stats and standings  
         'TeamsHalf'  split season data for teams 
   'TeamsFranchises'  franchise information       
  
 Post-season play:

    'BattingPost'  post-season batting statistics  
   'PitchingPost'  post-season pitching statistics 
   'FieldingPost'  post-season fielding data       
     'SeriesPost'  post-season series information  
  
 Awards:

        'AwardsManagers'  awards won by managers          
         'AwardsPlayers'  awards won by players           
   'AwardsShareManagers'  award voting for manager awards 
    'AwardsSharePlayers'  award voting for player awards  
  
 Hall of Fame: links to Master via 'hofID'

   'HallOfFame'  Hall of Fame voting data 
  
 Other tables:

 'AllstarFull' - All-Star games appearances; 'Managers' -
 managerial statistics; 'FieldingOF' - outfield position data;
 'ManagersHalf' - split season data for managers; 'Salaries' -
 player salary data; 'Appearances' - data on player appearances;
 'Schools' - Information on schools players attended;
 'CollegePlaying' - Information on schools players attended, by
 player and year;

 Variable label tables are provided for some of the tables:

 'battingLabels', 'pitchingLabels', 'fieldingLabels'

Universal keys?

  • Before starting to analyze the data, you’ll examine how the datasets are related to each other.
  • In particular, do any variable names span all of the datasets?
library(purrr)

# Examine lahmanNames
lahmanNames
## $AllstarFull
## # A tibble: 8 x 1
##   var        
##   <chr>      
## 1 playerID   
## 2 yearID     
## 3 gameNum    
## 4 gameID     
## 5 teamID     
## 6 lgID       
## 7 GP         
## 8 startingPos
## 
## $Appearances
## # A tibble: 21 x 1
##    var      
##    <chr>    
##  1 yearID   
##  2 teamID   
##  3 lgID     
##  4 playerID 
##  5 G_all    
##  6 GS       
##  7 G_batting
##  8 G_defense
##  9 G_p      
## 10 G_c      
## # ... with 11 more rows
## 
## $AwardsManagers
## # A tibble: 6 x 1
##   var     
##   <chr>   
## 1 playerID
## 2 awardID 
## 3 yearID  
## 4 lgID    
## 5 tie     
## 6 notes   
## 
## $AwardsPlayers
## # A tibble: 6 x 1
##   var     
##   <chr>   
## 1 playerID
## 2 awardID 
## 3 yearID  
## 4 lgID    
## 5 tie     
## 6 notes   
## 
## $AwardsShareManagers
## # A tibble: 7 x 1
##   var       
##   <chr>     
## 1 awardID   
## 2 yearID    
## 3 lgID      
## 4 playerID  
## 5 pointsWon 
## 6 pointsMax 
## 7 votesFirst
## 
## $AwardsSharePlayers
## # A tibble: 7 x 1
##   var       
##   <chr>     
## 1 awardID   
## 2 yearID    
## 3 lgID      
## 4 playerID  
## 5 pointsWon 
## 6 pointsMax 
## 7 votesFirst
## 
## $Batting
## # A tibble: 22 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 stint   
##  4 teamID  
##  5 lgID    
##  6 G       
##  7 AB      
##  8 R       
##  9 H       
## 10 X2B     
## # ... with 12 more rows
## 
## $battingLabels
## # A tibble: 2 x 1
##   var     
##   <chr>   
## 1 variable
## 2 label   
## 
## $BattingPost
## # A tibble: 22 x 1
##    var     
##    <chr>   
##  1 yearID  
##  2 round   
##  3 playerID
##  4 teamID  
##  5 lgID    
##  6 G       
##  7 AB      
##  8 R       
##  9 H       
## 10 X2B     
## # ... with 12 more rows
## 
## $CollegePlaying
## # A tibble: 3 x 1
##   var     
##   <chr>   
## 1 playerID
## 2 schoolID
## 3 yearID  
## 
## $Fielding
## # A tibble: 18 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 stint   
##  4 teamID  
##  5 lgID    
##  6 POS     
##  7 G       
##  8 GS      
##  9 InnOuts 
## 10 PO      
## 11 A       
## 12 E       
## 13 DP      
## 14 PB      
## 15 WP      
## 16 SB      
## 17 CS      
## 18 ZR      
## 
## $fieldingLabels
## # A tibble: 2 x 1
##   var     
##   <chr>   
## 1 variable
## 2 label   
## 
## $FieldingOF
## # A tibble: 6 x 1
##   var     
##   <chr>   
## 1 playerID
## 2 yearID  
## 3 stint   
## 4 Glf     
## 5 Gcf     
## 6 Grf     
## 
## $FieldingPost
## # A tibble: 17 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 teamID  
##  4 lgID    
##  5 round   
##  6 POS     
##  7 G       
##  8 GS      
##  9 InnOuts 
## 10 PO      
## 11 A       
## 12 E       
## 13 DP      
## 14 TP      
## 15 PB      
## 16 SB      
## 17 CS      
## 
## $HallOfFame
## # A tibble: 9 x 1
##   var        
##   <chr>      
## 1 playerID   
## 2 yearID     
## 3 votedBy    
## 4 ballots    
## 5 needed     
## 6 votes      
## 7 inducted   
## 8 category   
## 9 needed_note
## 
## $LahmanData
## # A tibble: 5 x 1
##   var  
##   <chr>
## 1 file 
## 2 class
## 3 nobs 
## 4 nvar 
## 5 title
## 
## $Managers
## # A tibble: 10 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 teamID  
##  4 lgID    
##  5 inseason
##  6 G       
##  7 W       
##  8 L       
##  9 rank    
## 10 plyrMgr 
## 
## $ManagersHalf
## # A tibble: 10 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 teamID  
##  4 lgID    
##  5 inseason
##  6 half    
##  7 G       
##  8 W       
##  9 L       
## 10 rank    
## 
## $Master
## # A tibble: 26 x 1
##    var         
##    <chr>       
##  1 playerID    
##  2 birthYear   
##  3 birthMonth  
##  4 birthDay    
##  5 birthCountry
##  6 birthState  
##  7 birthCity   
##  8 deathYear   
##  9 deathMonth  
## 10 deathDay    
## # ... with 16 more rows
## 
## $Pitching
## # A tibble: 30 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 stint   
##  4 teamID  
##  5 lgID    
##  6 W       
##  7 L       
##  8 G       
##  9 GS      
## 10 CG      
## # ... with 20 more rows
## 
## $pitchingLabels
## # A tibble: 2 x 1
##   var     
##   <chr>   
## 1 variable
## 2 label   
## 
## $PitchingPost
## # A tibble: 30 x 1
##    var     
##    <chr>   
##  1 playerID
##  2 yearID  
##  3 round   
##  4 teamID  
##  5 lgID    
##  6 W       
##  7 L       
##  8 G       
##  9 GS      
## 10 CG      
## # ... with 20 more rows
## 
## $Salaries
## # A tibble: 5 x 1
##   var     
##   <chr>   
## 1 yearID  
## 2 teamID  
## 3 lgID    
## 4 playerID
## 5 salary  
## 
## $Schools
## # A tibble: 5 x 1
##   var      
##   <chr>    
## 1 schoolID 
## 2 name_full
## 3 city     
## 4 state    
## 5 country  
## 
## $SeriesPost
## # A tibble: 9 x 1
##   var         
##   <chr>       
## 1 yearID      
## 2 round       
## 3 teamIDwinner
## 4 lgIDwinner  
## 5 teamIDloser 
## 6 lgIDloser   
## 7 wins        
## 8 losses      
## 9 ties        
## 
## $Teams
## # A tibble: 48 x 1
##    var     
##    <chr>   
##  1 yearID  
##  2 lgID    
##  3 teamID  
##  4 franchID
##  5 divID   
##  6 Rank    
##  7 G       
##  8 Ghome   
##  9 W       
## 10 L       
## # ... with 38 more rows
## 
## $TeamsFranchises
## # A tibble: 4 x 1
##   var       
##   <chr>     
## 1 franchID  
## 2 franchName
## 3 active    
## 4 NAassoc   
## 
## $TeamsHalf
## # A tibble: 10 x 1
##    var   
##    <chr> 
##  1 yearID
##  2 lgID  
##  3 teamID
##  4 Half  
##  5 divID 
##  6 DivWin
##  7 Rank  
##  8 G     
##  9 W     
## 10 L
# Find variables in common
reduce(lahmanNames, intersect)
## # A tibble: 0 x 1
## # ... with 1 variable: var <chr>

No variables span all 26 datasets.

Common keys

No variables span all 26 datasets, but several variables span more than one dataset. These variables provide relationships between the various tables in Lahman. Can you find them?

lahmanNames %>%  
  # Bind the data frames in lahmanNames
  bind_rows(.id = 'dataframe') %>%
  # Group the result by var
  group_by(var) %>%
  # Tally the number of appearances
  tally() %>%
  # Filter the data
  filter(n > 2) %>% 
  # Arrange the results
  arrange(desc(n)) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
var n
yearID 21
playerID 19
lgID 17
teamID 13
G 10
L 6
W 6
BB 5
CS 5
GS 5
H 5
HBP 5
HR 5
R 5
SB 5
SF 5
SO 5
awardID 4
GIDP 4
IBB 4
round 4
SH 4
stint 4
AB 3
CG 3
DP 3
E 3
ER 3
ERA 3
IPouts 3
label 3
SHO 3
SV 3
variable 3
WP 3
X2B 3
X3B 3

yearID, playerID, lgID, and teamID are the most common variable names.

playerID

In the last exercise, you saw that playerID is one of the most widely used keys in the Lahman database. Which datasets use playerID?

lahmanNames %>% 
  # Bind the data frames
  bind_rows(.id = 'dataframe') %>%
  # Filter the results
  filter(var == "playerID") %>% 
  # Extract the dataframe variable
  `$`(dataframe)
##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "CollegePlaying"     
## [10] "Fielding"            "FieldingOF"          "FieldingPost"       
## [13] "HallOfFame"          "Managers"            "ManagersHalf"       
## [16] "Master"              "Pitching"            "PitchingPost"       
## [19] "Salaries"

The playerID variable appears in 19 of the 26 datasets in Lahman.

Salaries

Who are the players?

Before we look at salaries, let’s begin by ensuring that we have salary information for each player in the database, or at least no systematic holes in our coverage.

To do this we will need a list of every player to compare against salaries. The Master dataset contains all of the players in the database, but it may contain multiple rows for each player.

How should you proceed?

We saw how to use distinct() to find unique rows for all columns in a table. You can also find unique rows for specific columns using thefollowing syntax:

tbl %>%
  # Find unique rows of columns a,b, and c
  distinct(a, b, c)
library(Lahman)
players <- Master %>% 
  # Return one row for each distinct player
  distinct(playerID, nameFirst, nameLast)

players %>% 
  head(10) %>%
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
playerID nameFirst nameLast
aardsda01 David Aardsma
aaronha01 Hank Aaron
aaronto01 Tommie Aaron
aasedo01 Don Aase
abadan01 Andy Abad
abadfe01 Fernando Abad
abadijo01 John Abadie
abbated01 Ed Abbaticchio
abbeybe01 Bert Abbey
abbeych01 Charlie Abbey

players contains a concise list of player IDs and names. Plus we can feel certain that players contains only one row for each distinct player.

Missing salaries

Now that we know who the players are, let’s check our salary coverage. How many players are completely missing salary information?

Salaries %>% 
  head(10) %>%
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
yearID teamID lgID playerID salary
1985 ATL NL barkele01 870000
1985 ATL NL bedrost01 550000
1985 ATL NL benedbr01 545000
1985 ATL NL campri01 633333
1985 ATL NL ceronri01 625000
1985 ATL NL chambch01 800000
1985 ATL NL dedmoje01 150000
1985 ATL NL forstte01 483333
1985 ATL NL garbege01 772000
1985 ATL NL harpete01 250000
players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = "playerID") %>%
  # Count them
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 14468

We are missing the salaries for 14,468 players. No wonder baseball players strike so often!

Unpaid games?

Now that we know there is a huge hole in the salary data, let’s see if we can explain it. Is it possible that these players somehow did not play (and hence did not earn a salary)?

We can check with the Appearances data frame. Appearances contains information about every game played in major league baseball. That is, if a player played a game, it would show up as a row in Appearances.

Appearances %>% 
  head(10) %>%
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
yearID teamID lgID playerID G_all GS G_batting G_defense G_p G_c G_1b G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
1871 TRO NA abercda01 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
1871 RC1 NA addybo01 25 25 25 25 0 0 0 22 0 3 0 0 0 0 0 0 0
1871 CL1 NA allisar01 29 29 29 29 0 0 0 2 0 0 0 29 0 29 0 0 0
1871 WS3 NA allisdo01 27 27 27 27 0 27 0 0 0 0 0 0 0 0 0 0 0
1871 RC1 NA ansonca01 25 25 25 25 0 5 1 2 20 0 1 0 0 1 0 0 0
1871 FW1 NA armstbo01 12 12 12 12 0 0 0 0 0 0 0 11 1 12 0 0 0
1871 RC1 NA barkeal01 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0
1871 BS1 NA barnero01 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0
1871 FW1 NA barrebi01 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0
1871 BS1 NA barrofr01 18 17 18 18 0 0 0 1 0 0 13 0 4 17 0 0 0
players %>% 
  anti_join(Salaries, by = "playerID") %>% 
  # How many unsalaried players appear in Appearances?
  semi_join(Appearances, by = "playerID") %>% 
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 14279

14,279 players played a game but are missing salary information. Interestingly, 191 players neither played a game nor have a recorded salary.

How many games?

Perhaps the unsalaried players only played one or two games, and hence did not earn a full salary. Can you determine how many games each of these unsalaried players played?

players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = "playerID") %>% 
  # Join them to Appearances
  left_join(Appearances, by = "playerID") %>% 
  # Calculate total_games for each player
  group_by(playerID) %>%
  summarize(total_games=sum(G_all, is.na = FALSE)) %>%
  # Arrange in descending order by total_games
  arrange(desc(total_games))
## # A tibble: 14,468 x 2
##    playerID  total_games
##    <chr>           <int>
##  1 yastrca01        3308
##  2 aaronha01        3298
##  3 cobbty01         3034
##  4 musiast01        3026
##  5 mayswi01         2992
##  6 robinbr01        2896
##  7 kalinal01        2834
##  8 collied01        2825
##  9 robinfr02        2808
## 10 wagneho01        2797
## # ... with 14,458 more rows

Many of these players appeared in thousands of games, which rules out our hypothesis that they did not appear in enough games to earn a salary.

How many at-bats?

Is it possible that the unsalaried players did not actually play in the games that they appeared in? One way to check would be to determine if the players had an at-bat (i.e. batted) in the games that they appeared in.

players %>%
  # Find unsalaried players
  anti_join(Salaries, by = "playerID") %>% 
  # Join Batting to the unsalaried players
  left_join(Batting, by = "playerID") %>% 
  # Group by player
  group_by(playerID) %>% 
  # Sum at-bats for each player
  summarize(total_at_bat = sum(AB, is.na = FALSE)) %>% 
  # Arrange in descending order
  arrange(desc(total_at_bat))
## # A tibble: 14,468 x 2
##    playerID  total_at_bat
##    <chr>            <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3 cobbty01         11435
##  4 musiast01        10972
##  5 mayswi01         10881
##  6 robinbr01        10654
##  7 wagneho01        10439
##  8 brocklo01        10332
##  9 ansonca01        10281
## 10 aparilu01        10230
## # ... with 14,458 more rows

The unpaid players definitely participated in the games. In fact, you can spot Ty Cobb and Hank Aaron in the top three at-bats. I think it’s safe to assume that you are dealing with missing data here and not unsalaried players.

Introducing the hall of fame

Hall of fame nominations

The Hall of Fame is a collection of distinguished baseball players selected by a committee of baseball experts. As with any hall of fame, more players are nominated for membership than are actually admitted.

  • Let’s see how many players have been nominated for the Hall of Fame.
# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>% 
  distinct(playerID)

nominated %>% 
  # Count the number of players in nominated
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1279
nominated_full <- nominated %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

There were 1,279 nominees for the Hall of Fame. We now have a dataset of everyone nominated! Let’s create a dataset for everyone inducted to the Hall of Fame in the next exercise.

Hall of fame inductions

In the previous exercise, we saw that 1,279 players were nominated for the hall of fame. Let’s now see how many players were admitted to the hall of fame to examine how selective the voting process is.

# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>% 
  filter(inducted == "Y") %>% 
  distinct(playerID)

inducted %>% 
  # Count the number of players in inducted
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   323
inducted_full <- inducted %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

323 players have been inducted into the Hall of Fame out of 1,279 nominees. We now also have datasets of everyone inducted and everyone nominated.

Awards

Now that we know who was inducted and who was nominated, let’s examine what separates the nominees who were inducted from the nominees who were not.

Let’s start with a simple question: Did nominees who were inducted earn more awards than nominees who were not inducted?

We can use AwardsPlayers to answer the question. It lists the playerID’s of players who won baseball awards, and it contains one row for each award awarded in major league baseball.

AwardsPlayers %>% 
  head(10) %>%
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
playerID awardID yearID lgID tie notes
bondto01 Pitching Triple Crown 1877 NL NA NA
hinespa01 Triple Crown 1878 NL NA NA
heckegu01 Pitching Triple Crown 1884 AA NA NA
radboch01 Pitching Triple Crown 1884 NL NA NA
oneilti01 Triple Crown 1887 AA NA NA
keefeti01 Pitching Triple Crown 1888 NL NA NA
clarkjo01 Pitching Triple Crown 1889 NL NA NA
rusieam01 Pitching Triple Crown 1894 NL NA NA
duffyhu01 Triple Crown 1894 NL NA NA
youngcy01 Pitching Triple Crown 1901 AL NA NA
# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>% 
  group_by(playerID) %>% 
  tally()

nAwards %>% 
  # Filter to just the players in inducted 
  inner_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  12.0
nAwards %>% 
  # Filter to just the players in nominated 
  inner_join(nominated, by = "playerID") %>% 
  # Filter to players NOT in inducted 
  anti_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  4.21

On Average, inductees had 12.04 - 4.21 = 7.83 more awards than non-inductees.

Salary

Salary may provide another way to differentiate inductees from non-inductees. Does the maximum salary earned by inductees tend to be greater than the maximum salary earned by nominees who were not inducted?

# Find the players who are in nominated, but not inducted
notInducted <- nominated %>% 
  setdiff(inducted)

Salaries %>% 
  # Find the players who are in notInducted
  semi_join(notInducted, by = "playerID") %>%
  # Calculate the max salary by player
  group_by(playerID) %>% 
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   5453789.
# Repeat for players who were inducted
Salaries %>% 
  semi_join(inducted, by = "playerID") %>% 
  group_by(playerID) %>% 
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   6666063.

It turns out that the average salary of players who were inducted was $5,079,720 - $4,677,737 = $401,983 more per year. We know that we have some missing salary information, but what else could be wrong with this analysis?

Retirement

One of the rules of the Hall of Fame is that players cannot be nominated until five years after they retire. Is this reflected in our data?

Appearances %>% 
  # Filter Appearances against nominated
  semi_join(nominated, by = "playerID") %>% 
  # Find last year played by player
  group_by(playerID) %>% 
  summarize(last_year = max(yearID)) %>% 
  # Join to full HallOfFame
  left_join(HallOfFame, by = "playerID") %>% 
  # Filter for unusual observations
  filter(yearID <= last_year)
## # A tibble: 39 x 10
##    playerID last_year yearID votedBy ballots needed votes inducted category
##    <chr>        <dbl>  <int> <chr>     <int>  <int> <int> <fct>    <fct>   
##  1 cissebi~      1938   1937 BBWAA       201    151     1 N        Player  
##  2 cochrmi~      1937   1936 BBWAA       226    170    80 N        Player  
##  3 deandi01      1947   1945 BBWAA       247    186    17 N        Player  
##  4 deandi01      1947   1946 Final ~     263    198    45 N        Player  
##  5 deandi01      1947   1946 Nomina~     202     NA    40 N        Player  
##  6 deandi01      1947   1947 BBWAA       161    121    88 N        Player  
##  7 dickebi~      1946   1945 BBWAA       247    186    17 N        Player  
##  8 dickebi~      1946   1946 Nomina~     202     NA    40 N        Player  
##  9 dickebi~      1946   1946 Final ~     263    198    32 N        Player  
## 10 dimagjo~      1951   1945 BBWAA       247    186     1 N        Player  
## # ... with 29 more rows, and 1 more variable: needed_note <chr>

It looks like quite a few players have been nominated before they retired, but this practice seems much less frequent in recent years.

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] purrr_0.3.0      ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0 
## [5] kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       magrittr_1.5      codetools_0.2-15 
## [34] scales_1.0.0      htmltools_0.3.6   assertthat_0.2.0 
## [37] rvest_0.3.2       colorspace_1.4-0  utf8_1.1.4       
## [40] stringi_1.3.1     lazyeval_0.2.1    munsell_0.5.0    
## [43] crayon_1.3.4